ggplot(fish_dat, aes(x = snapper_length)) +
geom_histogram(bins = 40,colour = "blue")ST405/ST645 Bayesian Data Analysis
Tutorial Questions (3)
Solutions
Question: Bayesian Regression Model - Fisherys Data
- Visualize the Snapper Length Data
Fit a Basic Bayesian Model
- A Bayesian model that assumes a Normal likelihood for the length data, with an overall mean and variance (ignoring age groups for now).
## model spec
normmodel ="
model{
for (i in 1:n){
y.i[i] ~ dnorm(mu, sigma^-2)
}
## priors
mu ~ dnorm(0,100^-2) # vague prior
sigma ~ dunif(0,30) # using variation not going beyond 30
## data reps
for(i in 1:n){yrep[i] ~ dnorm(mu,sigma^-2)}
}
"
## data and parameters
jags.data <- list(y.i = fish_dat$snapper_length,
n = nrow(fish_dat))
parnames <- c("mu","sigma","yrep")
## run JAGS
mod <- jags(data = jags.data,
parameters.to.save=parnames,
model.file = textConnection(normmodel),
n.iter = 5000,
n.burnin = 1000,
n.thin = 2)module glm loaded
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 256
Unobserved stochastic nodes: 258
Total graph size: 522
Initializing model
## output
m <- mod$BUGSoutput$sims.matrix- Assess Model Convergence
plot(mod) # quick check for convergence- Summarize Model Parameters
par_summary <- m %>%
gather_rvars(mu,sigma) %>%
median_qi(.value)
par_summary# A tibble: 2 × 7
.variable .value .lower .upper .width .point .interval
<chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 mu 6.10 5.87 6.34 0.95 median qi
2 sigma 1.91 1.75 2.09 0.95 median qi
Conduct Posterior Predictive Checks
(i) Compare Distributions
- Plot the observed distribution of \(y\) and overlay it with the distributions of 50 simulated datasets (\(y_{\text{rep}}\)) from the model.
y <- fish_dat$snapper_length
yrep <- mod$BUGSoutput$sims.list$yrep
ppc_dens_overlay(y, yrep[1:50, ]) This is ok, but not great. The mean in the yreps is shifted to the right and overdoing it on the lower end with higher density for the lower extremes than the observed data would suggest.
(ii) Evaluate Test Statistics
- Calculate the empirical values for the following test statistics based on the observed data:
- 1st percentile
- Median
- 97th percentile
- Compare these empirical values with the corresponding posterior predictive distributions of test statistics.
## test statistics
ppc_stat(y,yrep, stat = "median") `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The median is being overestimated in yreps rel. to the observed data
q1 <- function(y) quantile(y, 0.01) # create a function for the quantile (pp_stat help files give an example of this)
ppc_stat(y,yrep, stat = "q1")`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The lower extremes of yreps are mostly lower than observed lower extreme (based on 1st percentile).
q97 <- function(y) quantile(y, 0.97) # same but for 97th percentile
ppc_stat(y,yrep, stat = "q97")`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The upper extremes of yreps mostly lower than observed upper extreme.
(iii) Posterior Predictive \(p\)-values and Effect Sizes
# do median first
med_rep <- ppc_stat(y,yrep, stat = "median") # the data in this object contains medians for each yrep
## p-value (median)
sum(med_rep$data$value >= median(fish_dat$snapper_length))/length(med_rep$data$value)[1] 0.9931667
There’s a 99.4% chance of seeing a median as or more extreme than what was observed based on this model.
## effect size (median)
(median(fish_dat$snapper_length - median(med_rep$data$value)))/sd(med_rep$data$value)[1] -2.508263
The effect size is large reflecting what we saw in the posterior predictive probability.
# 1st percentile
perc_1_rep <- ppc_stat(y,yrep, stat = "q1") # the data in this object contains 1st percentiles for each yrep
## p-value (1st percentile)
1- sum(perc_1_rep$data$value >= quantile(fish_dat$snapper_length,probs = 0.01))/length(perc_1_rep$data$value)[1] 0.997
There’s a 99.5% chance of seeing a 1st percentile as or more extreme than what was observed, based on this model. So the model has lower extremes beyond what is suggested by the observed data.
## effect size (median)
(quantile(fish_dat$snapper_length,probs = 0.01) - median(perc_1_rep$data$value))/sd(perc_1_rep$data$value) 1%
2.276166
Large effect size again.
# 97th percentile
perc_97_rep <- ppc_stat(y,yrep, stat = "q97") # the data in this object contains 1st percentiles for each yrep
## p-value (97th percentile)
sum(perc_1_rep$data$value >= quantile(fish_dat$snapper_length,probs = 0.01))/length(perc_1_rep$data$value)[1] 0.003
There’s a 0.5% chance of seeing a 97th percentile as or more extreme than what was observed, based on this model. The model might underestimate the upper extremes.
## effect size (median)
(quantile(fish_dat$snapper_length,probs = 0.01) - median(perc_1_rep$data$value))/sd(perc_1_rep$data$value) 1%
2.276166
- Fit a Group-Specific Bayesian Model
## Model spec
mixmodel ="
model{
for (i in 1:n){
y.i[i] ~ dnorm(mu.g[group[i]], sigma.g[group[i]]^-2)
}
## priors
for(g in 1:ng)
{
mu.g[g] ~ dnorm(0,100^-2)
sigma.g[g] ~ dunif(0,30)
}
for(i in 1:n){yrep[i] ~ dnorm(mu.g[group[i]],sigma.g[group[i]]^-2)}
}
"
## data and parameters
jags.data <- list(y.i = fish_dat$snapper_length,
n = nrow(fish_dat),
group = fish_dat$group,
ng = 2)
parnames <- c("mu.g","sigma.g","yrep")
## run JAGS
mod <- jags(data = jags.data,
parameters.to.save=parnames,
model.file = textConnection(mixmodel),
n.iter = 5000,
n.burnin = 1000,
n.thin = 2)Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 256
Unobserved stochastic nodes: 260
Total graph size: 782
Initializing model
## output
m <- mod$BUGSoutput$sims.matrix
plot(mod)# (i)
y <- fish_dat$snapper_length
yrep <- mod$BUGSoutput$sims.list$yrep
ppc_dens_overlay(y, yrep[1:50, ])This is alot better than the 1st model
# (ii)
## test statistics
ppc_stat(y,yrep, stat = "median") `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
q1 <- function(y) quantile(y, 0.01)
ppc_stat(y,yrep, stat = "q1") `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
q97 <- function(y) quantile(y, 0.97)
ppc_stat(y,yrep, stat = "q97") `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
All of these have improved.
# (iii)
med_rep <- ppc_stat(y,yrep, stat = "median") # the data in this object contains medians for each yrep
## p-value
sum(med_rep$data$value >= median(fish_dat$snapper_length))/length(med_rep$data$value)[1] 0.5486667
There’s a 55% chance of seeing a median as or more extreme than what was observed based on this model. This is an improvement on the 1st model.
## effect size
(median(fish_dat$snapper_length - median(med_rep$data$value)))/sd(med_rep$data$value)[1] -0.11525
There’s a much smaller effect size
Notes
The posterior predictive ( p )-value indicates how many of the simulated datasets have test statistic values as extreme or more extreme than the value observed in the empirical data. It reflects the degree to which the model can replicate observed data characteristics.
The effect size quantifies the distance between the observed test statistic and the median of the posterior predictive distribution in standard deviation units. This gives an indication of how unusual the observed data is relative to the model predictions.